Categorizing ADC site page URL’s to more easily analyze user engagement.
Page: The page shows the part of the URL after your domain name (path) when someone has viewed content on your website. For example, if someone views https://www.example.com/contact then /contact will be reported as the page inside the Behavior reports.
User: An individual visitor to the site (tracked using browser cookies)
Sessions: A single visit to the website, consisting of one or more pageviews, and any other interactions (The default session timeout is 30 minutes)
User % of Total: Users displayed as a percentage of the Total Users during the report period
Pageviews: The number of times users view a page that has the Google Analytics tracking code inserted. This covers all page views; so if a user refreshes the page, or navigates away from the page and returns, these are all counted as additional page views.
Unique Pageviews: The unique pageview is the count of all the times the page was viewed in an individual session as a single event. If a user viewed the page once in their visit or five times, the number of unique pageviews will be counted as just one
Entrances: Entrance represents the number of visits that started on a specific web page or group of web pages. I.e. the first page that someone views during a session
Bounce Rate: The Bounce Rate is Bounce measured in percentage. It represents the number of visits when users leave your site after just one page view, regardless of how long they stayed on that page. (Total Bounces divided by total visits)
We will use the code and function below to categorize the Google Analytics dataset. The function takes messy character data within a dataframe and categorizes it based on a set of search string criteria. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these have to be correlated), and you have the option of naming the new column.
It is important to note that the order of the search strings matters for strings that are repeats – i.e. “catalog” and “catalog/submit” will be written over so you must identify the longer string first (i.e. catalog/submit). Additionally, make sure the order of the categories list correlates with the order of the search strings.
Source: https://github.com/lenwood
# List of search strings -- note that the longer search strings are identified first
search <- c("news", "portals", "about","catalogprofile", "catalogsubmit", "catalog", "training", "team", "home", "view", "submit", "profile")
# List of categories
categories <- c("News", "Portals", "About", "Summary", "Submit", "Cathome", "Training", "Team", "Home", "Dataset", "WhoMustSub", "Summary")
# Quickly categorize a data frame with a column of messy character strings.
# Replace "df" with your messy dataframe.
categorizeDF <- function(df, searchColName, searchList, catList, newColName="Category") {
# create empty data frame to hold categories
catDF <- data.frame(matrix(ncol=ncol(df), nrow=0))
colnames(catDF) <- paste0(names(df))
# add sequence so original order can be restored
df$sequence <- seq(nrow(df))
# iterate through the strings
for (i in seq_along(searchList)) {
rownames(df) <- NULL
index <- grep(searchList[i], df[,which(colnames(df) == searchColName)], ignore.case=TRUE)
tempDF <- df[index,]
tempDF$newCol <- catList[i]
catDF <- rbind(catDF, tempDF)
df <- df[-index,]
}
# OTHER category for unmatched rows
if (nrow(df) > 0) {
df$newCol <- "OTHER"
catDF <- rbind(catDF, df)
}
# return to the original order & remove the sequence data
catDF <- catDF[order(catDF$sequence),]
catDF$sequence <- NULL
# remove row names
rownames(catDF) <- NULL
# set Category type to factor
catDF$newCol <- as.factor(catDF$newCol)
# rename the new column
colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
catDF
}
# Replace "df" with messy dataframe
# Identify which column you want to categorize -- in our case with Google Analytics, we will be categorizing the "Page" column that contains messy URL strings. Additionally, you can name the new column that contains the categories (e.g. "Category").
sorted <- categorizeDF(df, "column name with messy data", search, categories, "new category column name")
###### TEST DATASET ######
# Remove backslashes and other symbols from Page column (includes hyphens and periods). **** Not sure if this is necessary. Am trying to differentiate the single "/" as the ADC Homepage, and make it easier to identify search terms for the function below.
test_users_clean <- top_30_users %>%
mutate_all(funs(gsub("[[:punct:]]", "", .)))
# Rename home page as "home" in dataframe **NOTE that for this particular dataset the "Home" page is the top viewed page and so I put in [1]. If it is not the top viewed page you will need to determine which row the homepage is and put that row number in the brackets. *** Is there a better way to do this?? ***
test_users_clean$Page[1] <- "home"
### Categorize the page URLS in the Page column into larger categories using a function ###
## Create a list of search strings to sort through pages and a list of categories (these must be correlated) **Order matters for strings that are repeats -- i.e. "catalog" and "catalog/submit" will be written over so you must identify the longer string first (i.e. catalog/submit).
# List of search strings
search <- c("news", "portals", "about","catalogprofile", "catalogsubmit", "catalog", "training", "team", "home", "view", "submit", "profile")
# List of categories
categories <- c("News", "Portals", "About", "Summary", "Submit", "Cathome", "Training", "Team", "Home", "Dataset", "WhoMustSub", "Summary")
## Create function [below] to categorize the messy "Page" column of the raw data frame.
# This function takes looks at a data frame column of messy character (or factorial) data, and produces a new column of categorized data. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these two have to be correlated), and you have the option of naming the new column.
# Function:
categorizeDF <- function(test_users_clean, searchColName, searchList, catList, newColName="Category") {
# create empty data frame to hold categories
catDF <- data.frame(matrix(ncol=ncol(test_users_clean), nrow=0))
colnames(catDF) <- paste0(names(test_users_clean))
# add sequence so original order can be restored
test_users_clean$sequence <- seq(nrow(test_users_clean))
# iterate through the strings
for (i in seq_along(searchList)) {
rownames(test_users_clean) <- NULL
index <- grep(searchList[i], test_users_clean[,which(colnames(test_users_clean) == searchColName)], ignore.case=TRUE)
tempDF <- test_users_clean[index,]
tempDF$newCol <- catList[i]
catDF <- rbind(catDF, tempDF)
test_users_clean <- test_users_clean[-index,]
}
# OTHER category for unmatched rows
if (nrow(test_users_clean) > 0) {
test_users_clean$newCol <- "OTHER"
catDF <- rbind(catDF, test_users_clean)
}
# return to the original order & remove the sequence data
catDF <- catDF[order(catDF$sequence),]
catDF$sequence <- NULL
# remove row names
rownames(catDF) <- NULL
# set Category type to factor
catDF$newCol <- as.factor(catDF$newCol)
# rename the new column
colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
catDF
}
# Call the function and create new data frame - using the raw data frame, the messy column you want to sort, the search and category lists, and name of the new column
sortedDF <- categorizeDF(test_users_clean, "Page", search, categories, "Category")
knitr::kable(sortedDF, format = "html")
| Page | Users | Sessions | Users_._of_Total | Pageviews | Unique_Pageviews | Entrances | Bounce_Rate | Category |
|---|---|---|---|---|---|---|---|---|
| home | 25436 | 42464 | 0440145 | 75951 | 55359 | 42443 | 0421957423 | Home |
| catalog | 4310 | 2280 | 0257363 | 12070 | 8734 | 2131 | 0247368421 | Cathome |
| catalog | 4130 | 416 | 0195397 | 33 | 26 | 19 | 0033653846 | Cathome |
| data | 3291 | 2306 | 0160785 | 19380 | 9507 | 2319 | 0273200347 | OTHER |
| catalogdata | 3114 | 3395 | 0139405 | 14923 | 7964 | 3298 | 0253608247 | Cathome |
| about | 2637 | 941 | 0123776 | 3942 | 3297 | 944 | 0582359192 | About |
| team | 1634 | 614 | 0110133 | 2554 | 2117 | 615 | 0684039088 | Team |
| submit | 1384 | 898 | 009936 | 3255 | 2395 | 901 | 0643652561 | WhoMustSub |
| page0 | 1174 | 1580 | 0090577 | 5362 | 3813 | 1582 | 0158860759 | OTHER |
| training | 1166 | 892 | 0083537 | 2245 | 1639 | 892 | 0515695067 | Training |
| publications | 1120 | 431 | 0077705 | 1701 | 1326 | 432 | 0744779582 | OTHER |
| share | 1060 | 528 | 0072758 | 4532 | 2706 | 529 | 0357954545 | OTHER |
| profile | 989 | 232 | 0068477 | 1609 | 1338 | 238 | 061637931 | Summary |
| qanda | 912 | 214 | 0064713 | 1430 | 1181 | 214 | 0570093458 | OTHER |
| january2019datasciencetrainingforarcticresearchers | 903 | 1004 | 0061441 | 1359 | 1191 | 1004 | 0815737052 | Training |
| datapage0 | 873 | 556 | 0058545 | 4704 | 2254 | 557 | 0303956835 | OTHER |
| catalogprofile | 799 | 193 | 0055914 | 1376 | 1121 | 188 | 0564766839 | Summary |
| proposals | 773 | 660 | 0053551 | 1187 | 1008 | 661 | 0762121212 | OTHER |
| homehtm | 735 | 800 | 0051402 | 936 | 817 | 800 | 056875 | Home |
| support | 729 | 121 | 0049463 | 1308 | 1004 | 122 | 058677686 | OTHER |
| dataplans | 685 | 384 | 0047672 | 932 | 827 | 384 | 0841145833 | OTHER |
| 2018datasciencetrainingforarcticresearchers | 649 | 639 | 0046015 | 982 | 876 | 639 | 0723004695 | Training |
| news201606datascienceopportunities | 629 | 371 | 0044488 | 810 | 733 | 371 | 0851752022 | News |
| upcomingdatasciencetrainingforarcticresearchers | 599 | 612 | 0043066 | 857 | 767 | 612 | 0823529412 | Training |
| catalogsubmit | 582 | 302 | 0041746 | 1657 | 1196 | 304 | 0523178808 | Submit |
| catalogportalspermafrost | 548 | 651 | 0040505 | 874 | 722 | 650 | 0769585253 | Portals |
| reconcilinghistoricalandcontemporarytrendsinterrestrialcarbonexchangeofthenorthernpermafrostzone | 546 | 844 | 0039355 | 1189 | 995 | 844 | 0808056872 | OTHER |
| viewdoi103334CDIAC00001V2017 | 522 | 562 | 0038272 | 769 | 619 | 562 | 0807829181 | Dataset |
| catalogshare | 521 | 399 | 0037263 | 1197 | 994 | 378 | 0483709273 | Cathome |
| categorynews | 512 | 197 | 0036317 | 822 | 672 | 197 | 0624365482 | News |
WORK IN PROGRESS
# Create circular graph that shoes proportion of users within each category (34 total categories)
circos.clear()
category = annual_sortedDF$Category
percent = sort(sample(40:80, 34))
color = rev(rainbow(length(percent)))
circos.par("start.degree" = 90, cell.padding = c(0, 0, 0, 0),
canvas.xlim=c(-1.2, 1.2), # bigger canvas?
canvas.ylim=c(-1.2, 1.2))
circos.initialize("a", xlim = c(0, 100)) # 'a` just means there is one sector
circos.track(ylim = c(1, length(percent)+1), track.height = 0.9,
bg.border = NA, panel.fun = function(x, y) {
xlim = CELL_META$xlim
circos.segments(rep(xlim[1], 34), 1:34,
rep(xlim[2], 34), 1:34,
col = "#CCCCCC")
circos.rect(rep(0, 34), 1:34 - 0.45, percent, 1:34 + 0.45,
col = color, border = "white")
circos.text(rep(xlim[1], 34), 1:34,
paste(category, " - ", percent, "%"),
facing = "downward", adj = c(1.05, 0.5), cex = 0.8)
breaks = seq(0, 85, by = 5)
circos.axis(h = "top", major.at = breaks, labels = paste0(breaks, "%"),
labels.cex = 0.6)
})